This project to predict the sucess of startups is conducted according to the Data Science Lifecycle

1 Plan

1.1 Identify Use Case

The objective of this project is to predict the success of startups. A startup is a new company founded by an entrepreneur with the intent to grow beyond the solo founder according to wikipedia. Startups play major role in the economy, as they foster innovation and create employment as they grow. Yet, they face high uncertainty and need to find investors to continue their ideas and expand their potential. This project focuses on the investors to find companies with great potential to invest in and thus being one step ahead of the competition. A business model canvas is used to describe the use case in more detail.

  1. Customer Segments: The costumers of this use case are investors that want to investigate the startup landscape quickly and want to make the best possible investment. The investors’ biggest pain is the uncertainty of the investment in a startup. Even with a well-developed business plan and promising market research, there is no guarantee that the startup will be successful. Furthermore, startups are inherently risky ventures. They often have limited operating history and may face significant challenges in scaling up their operations, acquiring customers, and generating revenue. Moreover, investing in a startup requires a significant time commitment. Investors need to research potential opportunities, evaluate business plans, and actively monitor their investments. Their gains include high returns in the case that the startup will be successful. Especially early-stage investors have the opportunity to purchase equity at a low valuation, which can result in a significant return on investment if the company is acquired or goes public. In addition, investing in startups can help diversify an investor’s portfolio. Startups can provide exposure to new market segments that may be difficult to access through traditional investments. Furthermore, some investors may be motivated by the social impact that startups can have. By investing in innovative companies that are addressing social or environmental challenges, investors can support positive change while also generating financial returns.

  2. Value Proposition: This projects addresses all of these pain points by providing detailed analysis of many variables former startups to transfer these learnings to new investments. It highlights the most important features of a startup to become successful and predicts whether a startup which is currently operating turns into a success or a failure. The success of a company is defined as the event that gives the company’s founders a large sum of money through the process of M&A (Merger and Acquisition) or an IPO (Initial Public Offering). A company would be considered as failed if it had to be shut down. This minimizes the uncertainty and risk of the investment. As this is provided in one platform, the time commitment and intense research is reduced tremendously. The data can always be extended through more startup data provided by the customers. This has the advantage that possible changes in the successful features are encountered and the use case always provides the latest and most valuable information.

  3. Channels: Specific investors will be contacted directly to learn about this startup success prediction and how to purchase it. In addition, the use case will be advertised on business platforms and online newspapers.

  4. Customer Relationships: A close customer collaboration will be created, that the investors have the information about startup success prediction and also share information about their past investments that the data is always improving and up to date.

  5. Revenue Streams: The startup success prediction can be purchased as a yearly abonnement by investors. They can get a reduction of the fee in exchange for startup investment data that can be incorporated in the analysis and prediction.

  6. Key Activities: The most important activity is always having a well optimized classification model for the startup success prediction. Another key activity is the data analysis that is used as a base for the prediction model. The third activity is to create a good user experience to visualize the key learnings and recommendations.

  7. Key Resources: The project is conducted by Helena Schick. Jan Kirenz can be consulted for advice.

  8. Key Partnerships: The key partnerships are with the customers as well, as they provide data to improve the model.

  9. Cost Structure: There are currently no costs, as this is done volunatarily on behalf of the module “Programming Languages for Data Science”.

1.2 Frame Problem

We are investigating the characteristics of startups Because we want to find out, if a startup will be successful In order to decide to invest in it or not.

We want the model to classify the status of a startup Our ideal outcome is “acquired”, which means that the startup was successful In order to reduce the uncertainty in startup investments.

1.3 Identify Variables

For structured data problems, we need to identify potentially relevant variables. The response variable is the categorical status (acquired or closed - if a startup is ‘acquired’ by some other organization, means the startup succeeded) and there are many explanatory variables, which include the category, location, funding rounds and amount. The data set is from kaggle and contains data about startups since the late 1990s in the United States of America.

1.4 Define Metrics

Our success metrics are accuracy (percentage of startups that the model correctly predicts as successful or unsuccessful) and precision (indicates that the model is making fewer false positive predictions). Our key results for the success metrics are an accuracy and precision of more than 80%. Our project is deemed a failure, if the accuracy and precision are lower than 80%.

2 Data

2.1 Data Ingestion

2.1.1 Prepare Environment

At first, R is set up.

Then the necessary libraries for the entire notebook are activated and have been installed before.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(conflicted)
library(dplyr)
library(visdat)
library(rsample)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.3     ✔ recipes      1.0.6
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
library(skimr)
library(purrr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(DBI)
library(maps)
library(mapproj)
library(caret)
## Loading required package: lattice
library(xgboost)
library(kknn)
library(vip)
library(tidy.outliers)

2.1.2 Import Data

Currently csv upload, could chnage this to upload from a database.

path <- "/Users/helena.schick/Documents/GitHub/r-sql-startup-success/startup data.csv"

df <- read_csv(path)
## Rows: 923 Columns: 49
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): state_code, zip_code, id, city, Unnamed: 6, name, founded_at, clos...
## dbl (35): Unnamed: 0, latitude, longitude, labels, age_first_funding_year, a...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.1.3 Data Structure

Now, we’re having a first look at the data, including the column names, data types and their content.

glimpse(df)
## Rows: 923
## Columns: 49
## $ `Unnamed: 0`             <dbl> 1005, 204, 1001, 738, 1002, 379, 195, 875, 16…
## $ state_code               <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA…
## $ latitude                 <dbl> 42.35888, 37.23892, 32.90105, 37.32031, 37.77…
## $ longitude                <dbl> -71.05682, -121.97372, -117.19266, -122.05004…
## $ zip_code                 <chr> "92101", "95032", "92121", "95014", "94105", …
## $ id                       <chr> "c:6669", "c:16283", "c:65620", "c:42668", "c…
## $ city                     <chr> "San Diego", "Los Gatos", "San Diego", "Cuper…
## $ `Unnamed: 6`             <chr> NA, NA, "San Diego CA 92121", "Cupertino CA 9…
## $ name                     <chr> "Bandsintown", "TriCipher", "Plixi", "Solidco…
## $ labels                   <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, …
## $ founded_at               <chr> "1/1/2007", "1/1/2000", "3/18/2009", "1/1/200…
## $ closed_at                <chr> NA, NA, NA, NA, "10/1/2012", "2/15/2009", NA,…
## $ first_funding_at         <chr> "4/1/2009", "2/14/2005", "3/30/2010", "2/17/2…
## $ last_funding_at          <chr> "1/1/2010", "12/28/2009", "3/30/2010", "4/25/…
## $ age_first_funding_year   <dbl> 2.2493, 5.1260, 1.0329, 3.1315, 0.0000, 4.545…
## $ age_last_funding_year    <dbl> 3.0027, 9.9973, 1.0329, 5.3151, 1.6685, 4.545…
## $ age_first_milestone_year <dbl> 4.6685, 7.0055, 1.4575, 6.0027, 0.0384, 5.002…
## $ age_last_milestone_year  <dbl> 6.7041, 7.0055, 2.2055, 6.0027, 0.0384, 5.002…
## $ relationships            <dbl> 3, 9, 5, 5, 2, 3, 6, 25, 13, 14, 22, 8, 0, 15…
## $ funding_rounds           <dbl> 3, 4, 1, 3, 2, 1, 3, 3, 3, 3, 3, 5, 1, 3, 5, …
## $ funding_total_usd        <dbl> 375000, 40100000, 2600000, 40000000, 1300000,…
## $ milestones               <dbl> 3, 1, 2, 1, 1, 1, 2, 3, 4, 4, 3, 2, 0, 3, 1, …
## $ state_code.1             <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA…
## $ is_CA                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, …
## $ is_NY                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ is_MA                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ is_TX                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_otherstate            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ category_code            <chr> "music", "enterprise", "web", "software", "ga…
## $ is_software              <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ is_web                   <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ is_mobile                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, …
## $ is_enterprise            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_advertising           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_gamesvideo            <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_ecommerce             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_biotech               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_consulting            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_othercategory         <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ object_id                <chr> "c:6669", "c:16283", "c:65620", "c:42668", "c…
## $ has_VC                   <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, …
## $ has_angel                <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
## $ has_roundA               <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, …
## $ has_roundB               <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, …
## $ has_roundC               <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ has_roundD               <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, …
## $ avg_participants         <dbl> 1.0000, 4.7500, 4.0000, 3.3333, 1.0000, 3.000…
## $ is_top500                <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, …
## $ status                   <chr> "acquired", "acquired", "acquired", "acquired…

The column Unnamed: 0 seems to be some kind of id, but has no further value and will be deleted. Labels and avg_participants cannot be understood without further context. Unnamed: 6 is a combination of the zip code and city and thus redundant and will be deleted. The redundancy also accounts for state_code.1 and object_id. For categories and states there are each a categorical variable and several one hot encoded ones. All are kept at this point, because the categorical is better for visualizations and data understanding, but the one hot encoded ones are needed for the classification model. The variables are described in more detail:

Variable Description Data Type
state_code Abbreviation of US state of startup’s location character
latitude Geographic North-South location of startup’s location double
longitude Geographic East-West location of startup’s location double
zip_code Zip code of the city of startup’s location double
id Unique identification of startup string
city Name of city of startup’s location string
name Name of startup string
founded_at Founding date of startup (MM/DD/YYYY) date
closed_at Closing date of startup (MM/DD/YYYY) date
first_funding_at Date of first funding received of startup (MM/DD/YYYY) date
last_funding_at Date of last funding received of startup (MM/DD/YYYY) date
age_first_milestone_year Age of startup when achieving first milestone double
age_last_milestone_year Age of startup when achieving last milestone double
relationships Amount of relationships of startup with investors, mentors,… double
funding_rounds Amount of funding rounds double
funding_total_usd Total amount of funding received in USD double
milestones Amount of milestones achieved double
is_CA Startup’s location is in CA (1 = true, 0 = false) double
is_NY Startup’s location is in NY (1 = true, 0 = false) double
is_MA Startup’s location ismin MA (1 = true, 0 = false) double
is_TX Startup’s location is in TX (1 = true, 0 = false) double
is_otherstate Startup’s location is in another state than CA, NY, MA or TX (1 = true, 0 = false) double
category_code Business category of startup string
is_software Startup’s category is software (1 = true, 0 = false) double
is_web Startup’s category is web (1 = true, 0 = false) double
is_mobile Startup’s category is mobile (1 = true, 0 = false) double
is_enterprise Startup’s category is enterprise (1 = true, 0 = false) double
is_advertising Startup’s category is advertising (1 = true, 0 = false) double
is_gamesvideo Startup’s category is gamesvideo (1 = true, 0 = false) double
is_biotech Startup’s category is biotech (1 = true, 0 = false) double
is_ecommerce Startup’s category is ecommerce (1 = true, 0 = false) double
is_consulting Startup’s category is consulting (1 = true, 0 = false) double
is_othercategory Startup’s category is another category than the previously mentioned (1 = true, 0 = false) double
has_VC Startup is financed through Venture Capital (1 = true, 0 = false) double
has_angel Startup is financed through an angel investor (1 = true, 0 = false) double
has_roundA Startup has succeeded in first major funding round (1 = true, 0 = false) double
has_roundB Startup has succeeded in second major funding round (1 = true, 0 = false) double
has_roundC Startup has succeeded in third major funding round (1 = true, 0 = false) double
has_roundD Startup has succeeded in fourth major funding round (1 = true, 0 = false) double
is_top500 Startup is listed in the Top500 startups (1 = true, 0 = false) double
status Success variable of startup (acquired = successful, closed = unsuccessful) string

To understand the ratio of numerical to categorical variables and view the amount of null values in the dataframe, we visualize it.

vis_dat(df)

The majority of variables are numeric. There are many null values in the columns Unnamed:6, which will be deleted anyways, as well as closed_at, because this only applies, if the startup has failed, and age_first_milestone and age_last_milestone. We are keeping these two variables to get some insights about those startups that have them available.

2.1.4 Data Corrections

At first, all variables of the data type character are turned into factors. This has the advantages of memory efficiency, ordering, handling of missing values, visualization and improved functionality.

# convert all character variables to factors 
df <- 
  df %>% 
  mutate(across(where(is.character), as.factor))

Secondly, the variables taht represent dates are turned into the data type date.

# convert all date variables to actual dates 
df$founded_at <- 
  as.Date(df$founded_at, format = "%m/%d/%Y")
df$closed_at <- 
  as.Date(df$closed_at, format = "%m/%d/%Y")
df$first_funding_at <- 
  as.Date(df$first_funding_at, format = "%m/%d/%Y")
df$last_funding_at <- 
  as.Date(df$last_funding_at, format = "%m/%d/%Y")

In the end, unnecessary varibales are removed.

# remove unnecessary variables
df <- 
  dplyr::select(df, -`Unnamed: 0`, -`Unnamed: 6`, -state_code.1, -object_id, -labels, -avg_participants)

2.1.5 Variable List

For easier usage in data splitting, several variable lists are created.

# define outcome variable as y_label
y_label <- 'status'

# select feature names
features <- 
  df %>%
  select(-all_of(y_label)) %>%
  names()

# create feature data for data splitting
X <- 
  df %>%
  select(all_of(features))

# list of numeric feature names
feat_num <- 
  X %>% 
  select(where(is.numeric)) %>% 
  names()

# list of categorical feature names
feat_cat <- 
  X %>% 
  select(!where(is.numeric)) %>% 
  names()

# create response for data splitting
y <- 
  df %>% 
  select(all_of(y_label))

2.2 Data Splitting

2.2.1 Train and Test Split

The dataframe is split into a training and test set. The training set will be used to train the model and the the test set is used to verify how well the model performs with newly added data.

# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible 
set.seed(42)

# Put 3/4 of the data into the training set 
data_split <- initial_split(df, 
                           prop = 3/4, 
                           strata = status, 
                           breaks = 4)

# Create dataframes for the two sets:
train_data <- training(data_split) 
test_data <- testing(data_split)

2.2.2 Data Exploration Set

A copy of the training data is created for the data exploration to not alter the actual training data.

df_train <- train_data 

2.3 Analyze Data

Data analysis is done through R and SQL. It gives insights in the distribution of each feature, as well as patterns and correlations. At first the scales of each feature are analyzed, including the quartiles, mean and median. Using skim, this also includes a rough histogram.

skim(df_train)
Data summary
Name df_train
Number of rows 691
Number of columns 43
_______________________
Column type frequency:
Date 4
factor 7
numeric 32
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
founded_at 0 1.00 1984-01-01 2013-04-16 2006-01-01 182
closed_at 442 0.36 2001-01-01 2013-10-30 2012-03-06 164
first_funding_at 0 1.00 2000-01-01 2013-11-19 2007-09-01 473
last_funding_at 0 1.00 2001-02-26 2013-11-19 2009-12-29 534

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state_code 0 1 FALSE 31 CA: 365, NY: 77, MA: 63, WA: 34
zip_code 0 1 FALSE 319 941: 26, 941: 19, 950: 15, 941: 14
id 0 1 FALSE 690 c:2: 2, c:1: 1, c:1: 1, c:1: 1
city 0 1 FALSE 184 San: 95, New: 65, Mou: 33, Pal: 27
name 0 1 FALSE 690 Red: 2, #wa: 1, 41s: 1, Abo: 1
category_code 0 1 FALSE 35 sof: 117, web: 111, mob: 56, ent: 53
status 0 1 FALSE 2 acq: 447, clo: 244

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latitude 0 1.00 38.58 3.86 25.75 37.39 37.78 40.73 5.934e+01 ▁▇▅▁▁
longitude 0 1.00 -103.57 22.66 -122.73 -122.20 -118.35 -77.39 1.806e+01 ▇▅▁▁▁
age_first_funding_year 0 1.00 2.17 2.46 -9.05 0.56 1.34 3.49 2.190e+01 ▁▇▃▁▁
age_last_funding_year 0 1.00 3.88 2.95 -9.05 1.59 3.50 5.56 2.190e+01 ▁▇▇▁▁
age_first_milestone_year 112 0.84 2.97 2.93 -14.17 0.91 2.58 4.57 2.468e+01 ▁▃▇▁▁
age_last_milestone_year 112 0.84 4.66 3.14 -3.88 2.41 4.39 6.69 2.468e+01 ▂▇▂▁▁
relationships 0 1.00 7.60 6.90 0.00 3.00 5.00 10.00 5.700e+01 ▇▂▁▁▁
funding_rounds 0 1.00 2.32 1.39 1.00 1.00 2.00 3.00 1.000e+01 ▇▃▁▁▁
funding_total_usd 0 1.00 27859192.08 218367604.85 11000.00 3000000.00 10400000.00 25065805.50 5.700e+09 ▇▁▁▁▁
milestones 0 1.00 1.84 1.33 0.00 1.00 2.00 3.00 8.000e+00 ▇▇▁▁▁
is_CA 0 1.00 0.53 0.50 0.00 0.00 1.00 1.00 1.000e+00 ▇▁▁▁▇
is_NY 0 1.00 0.11 0.31 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_MA 0 1.00 0.09 0.29 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_TX 0 1.00 0.04 0.20 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_otherstate 0 1.00 0.23 0.42 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▂
is_software 0 1.00 0.17 0.38 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▂
is_web 0 1.00 0.16 0.37 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▂
is_mobile 0 1.00 0.08 0.27 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_enterprise 0 1.00 0.08 0.27 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_advertising 0 1.00 0.07 0.25 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_gamesvideo 0 1.00 0.05 0.23 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_ecommerce 0 1.00 0.03 0.17 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_biotech 0 1.00 0.04 0.19 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_consulting 0 1.00 0.00 0.05 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_othercategory 0 1.00 0.32 0.47 0.00 0.00 0.00 1.00 1.000e+00 ▇▁▁▁▃
has_VC 0 1.00 0.32 0.47 0.00 0.00 0.00 1.00 1.000e+00 ▇▁▁▁▃
has_angel 0 1.00 0.25 0.43 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▂
has_roundA 0 1.00 0.51 0.50 0.00 0.00 1.00 1.00 1.000e+00 ▇▁▁▁▇
has_roundB 0 1.00 0.40 0.49 0.00 0.00 0.00 1.00 1.000e+00 ▇▁▁▁▅
has_roundC 0 1.00 0.24 0.43 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▂
has_roundD 0 1.00 0.09 0.29 0.00 0.00 0.00 0.00 1.000e+00 ▇▁▁▁▁
is_top500 0 1.00 0.81 0.39 0.00 1.00 1.00 1.00 1.000e+00 ▂▁▁▁▇

Dates:

The founded_at ranges between 1984 and 2013. Its median is in 2006, that’s why founding dates in the 20th might be outliers. Closed_at is very right skewed with the median in 2012 and maximum in 2013, but the earliest closing in 2001. The first_funding_at seems be be quite evenly distributed, the same applies to last_founding_at.

Categorical features:

There are startups in 31 states, but CA is most common. Dividing the location up into zip codes, there are 319 different ones in 184 cities. The name and id are almost unique, having one duplicate, which needs to be removed. The startup are in 35 different categories and 2 statuses.

Numerical features:

The latitude ranges between 25.75° and 59.33°, where most startups are located at around 37°, because the second and quartile as well as the mean are around that value. The longitude ranges from -122.72° to 18.05° and is left skewed, because the first and second quartile are at around -122°. The bigger values might include some outliers. The latitude and longitude will be visualized on a map, because it is hard to imagine their distribution by looking at values. The age_first_funding_year has a mean of 2.17 years and a median of 1.34 years, which indicated a right skewed distribution. The maximum of over 20 years seems to be an outlier. A similar distribution applies to age_last_funding_year, with a maximum of over 20 years. Both have negative minimums, which doesn’t make sense as the funding would have taken place before the founding. These outliers need to be investigated. Age_first_milestone_year and age_last_milestone_year have a similar ditribution as the first and last funding ages. They also have negative minimums, which doesn’t make sense as the milestone would have taken place before the founding. These outliers need to be investigated. Most startups have between 5 and 10 relationships, the minimum is 0 and the maximum 57. All startups in this dataframe have been through at least 1 funding round and in average through 2. The maximum of 10 funding rounds seems to be an outlier, as usually there are only the 4 funding rounds A to D. This needs to be investigated. Funding_total_usd is left skewed, as the mean is almost 3 times higher than the mean. Outliers on the upper end need to be investigated. Most startups have between 0 and 3 milestones, the maximum is 8. 53% of the startups are located in CA, 11% in NY, 9% in MA and 4% in Texas. Also 17% of the startups are in software and 16% in web. It doesn’t make sense to further consider these one hot encoded variables in the numeric feature analysis. This also applies to the funding variables, but 32% of the startups have Venture Capital and 25% a business angel. 51% succeed in Round A funding, only 40% in Round B, 24% in Round C and only 9% in Round D. 81% made it to the Top500 though.

2.3.1 SQL Analysis

To use SQL in RStudio a connection is created and the following code cells start with {sql connection=…} instead of {r}.

# Connection to database SQlite
con <- dbConnect(RSQLite::SQLite(), ":memory:")

# Write data "df_train" into database
dbWriteTable(con, "df_train", df_train)

# List tables
dbListTables(con)
## [1] "df_train"

At first, we want to get an overview of the training data and have a look at the first 5 rows.

SELECT *
FROM df_train
LIMIT 5;
5 records
state_code latitude longitude zip_code id city name founded_at closed_at first_funding_at last_funding_at age_first_funding_year age_last_funding_year age_first_milestone_year age_last_milestone_year relationships funding_rounds funding_total_usd milestones is_CA is_NY is_MA is_TX is_otherstate category_code is_software is_web is_mobile is_enterprise is_advertising is_gamesvideo is_ecommerce is_biotech is_consulting is_othercategory has_VC has_angel has_roundA has_roundB has_roundC has_roundD is_top500 status
CA 42.35888 -71.05682 92101 c:6669 San Diego Bandsintown 13514 NA 14335 14610 2.2493 3.0027 4.6685 6.7041 3 3 375000 3 1 0 0 0 0 music 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 acquired
CA 37.23892 -121.97372 95032 c:16283 Los Gatos TriCipher 10957 NA 12828 14606 5.1260 9.9973 7.0055 7.0055 9 4 40100000 1 1 0 0 0 0 enterprise 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1 acquired
CA 32.90105 -117.19266 92121 c:65620 San Diego Plixi 14321 NA 14698 14698 1.0329 1.0329 1.4575 2.2055 5 1 2600000 2 1 0 0 0 0 web 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 acquired
CA 38.05711 -122.51374 94901 c:5192 San Rafael ClairMail 12418 NA 13019 14886 1.6466 6.7616 5.6055 7.3616 25 3 34100000 3 1 0 0 0 0 finance 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 acquired
NY 40.70276 -73.98667 11201 c:1491 Brooklyn drop.io 13843 NA 13850 14732 0.0192 2.4356 0.7945 4.3781 15 3 9950000 3 0 1 0 0 0 education 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 acquired

Count all distinct observations with the status “acquired”.

SELECT DISTINCT COUNT(*)
FROM df_train
WHERE status = 'acquired';
1 records
COUNT(*)
447

There are 447 startups in the training data that are successful.

Count all distinct observations with the status “closed”.

SELECT DISTINCT COUNT(*)
FROM df_train
WHERE status = 'closed';
1 records
COUNT(*)
244

There are 244 startups in the training data that are not successful. Neither of these too small to worry about an uneven distribution for the model performance later on.

Which successful startups receive the most funding? Also providing more information about these companies.

SELECT name, category_code, state_code, funding_total_usd, age_first_funding_year, funding_rounds
FROM df_train
WHERE status = "acquired"
ORDER BY funding_total_usd DESC
LIMIT 5;
5 records
name category_code state_code funding_total_usd age_first_funding_year funding_rounds
Clearwire mobile WA 5700000000 4.6000 4
Pearl Therapeutics biotech CA 232000100 2.5041 6
Luminus Devices other MA 135826373 4.3315 4
Tumblr web NY 125250000 0.7452 6
Dexterra mobile WA 122500000 5.6301 5

The startups that received the most funding are Clearwire, Pearl Therapeutics and Luminus Devices.

In which state and category is the most funding raised?

SELECT category_code, state_code, SUM(funding_total_usd)
FROM df_train
WHERE status = "acquired"
ORDER BY SUM(funding_total_usd) DESC;
1 records
category_code state_code SUM(funding_total_usd)
music CA 15279949206

Most funding is raised in the music business in California.

In which categories are most successful startups founded?

SELECT category_code, COUNT(status = "acquired") as count
FROM df_train
GROUP BY category_code
ORDER BY count DESC;
Displaying records 1 - 10
category_code count
software 117
web 111
mobile 56
enterprise 53
advertising 45
games_video 37
semiconductor 27
biotech 26
network_hosting 23
ecommerce 21

Most successful startups are in the software and web business. There are only very few successful startups in the categories sports, hospitality, health and automotive.

In which states are most successful startups founded?

SELECT state_code, COUNT(status = "acquired") as count
FROM df_train
GROUP BY category_code
ORDER BY count DESC;
Displaying records 1 - 10
state_code count
CO 117
CA 111
NH 56
CA 53
NY 45
CA 37
CA 27
MA 26
NC 23
CA 21

Most successful startups are founded in Colorado and California.

How many successful vs. unsuccessful startups have Round A funding?

SELECT COUNT(*)
FROM df_train
WHERE has_roundA = 1 AND status = "acquired";
1 records
COUNT(*)
256
SELECT COUNT(*)
FROM df_train
WHERE has_roundA = 1 AND status = "closed";
1 records
COUNT(*)
97
SELECT COUNT(*)
FROM df_train
WHERE has_roundA = 1 AND status = "acquired" and age_first_funding_year < 2;
1 records
COUNT(*)
202
SELECT COUNT(*)
FROM df_train
WHERE has_roundA = 1 AND status = "closed" and age_first_funding_year < 2;
1 records
COUNT(*)
66

256 startups that have a Round A funding are successful, 97 are not. Looking at young startups, who received their first funding before turning two years old, 202 received Round A funding, but 66 failed. This shows that most startups are younger than two years when receiving their Round A funding, but it’s hard to predict whether they will succeed.

Now looking at the same comparison for Round B funding.

SELECT COUNT(*)
FROM df_train
WHERE has_roundB = 1 AND status = "acquired";
1 records
COUNT(*)
213
SELECT COUNT(*)
FROM df_train
WHERE has_roundB = 1 AND status = "closed";
1 records
COUNT(*)
62

213 startups who receive Round B funding succeed and 62 fail. As these numbers are not tremendously lower than the round A funding, this indicates that many startups who received a Round A funding, also get Round B funding. The ratio of succeeding startups has risen a lot from Round A to Round B, so the investment is already more certain.

Having a closer look at the next funding round, Round C.

SELECT COUNT(*)
FROM df_train
WHERE has_roundc = 1 AND status = "acquired";
1 records
COUNT(*)
135
SELECT COUNT(*)
FROM df_train
WHERE has_roundC = 1 AND status = "closed";
1 records
COUNT(*)
33

These number got quite lower, as only 135 of startups in Round C funding succeed and 33 fail.

And having a closer look at the last funding round, Round D.

SELECT COUNT(*)
FROM df_train
WHERE has_roundD = 1 AND status = "acquired";
1 records
COUNT(*)
54
SELECT COUNT(*)
FROM df_train
WHERE has_roundD = 1 AND status = "closed";
1 records
COUNT(*)
9

Comparably few startups make it to Round D, only 54 succeed and 9 fail.

Other investment options are also considered. First having a look at the success of startups that have venture capital.

SELECT COUNT(*)
FROM df_train
WHERE has_vc = 1 AND status = "acquired";
1 records
COUNT(*)
139
SELECT COUNT(*)
FROM df_train
WHERE has_vc = 1 AND status = "closed";
1 records
COUNT(*)
85

139 startups that have venture capital succeed and 85 fail. This ratio is worse than in funding rounds.

How about business angel investments?

SELECT COUNT(*)
FROM df_train
WHERE has_angel = 1 AND status = "acquired";
1 records
COUNT(*)
93
SELECT COUNT(*)
FROM df_train
WHERE has_angel = 1 AND status = "closed";
1 records
COUNT(*)
77

The ratio of failing to succeeding startups, who have a business angel investment, is the worst in comparison to other invetsment options, 77 fail and 93 succeed.

How many funding rounds to successful vs. failed startups have?

SELECT AVG(funding_rounds)
FROM df_train
WHERE status = "acquired";
1 records
AVG(funding_rounds)
2.525727
SELECT AVG(funding_rounds)
FROM df_train
WHERE status = "closed";
1 records
AVG(funding_rounds)
1.942623

Successful startups have more funding_rounds than failed ones, which makes total sense, because most startups go through all funding rounds before being acquired.

How does the duration between achieving major milestones differ between successful vs. failed startups?

SELECT AVG(age_last_milestone_year - age_first_milestone_year) as avg_milestoneduration
FROM df_train
WHERE status = "acquired";
1 records
avg_milestoneduration
1.939018
SELECT AVG(age_last_milestone_year - age_first_milestone_year) as avg_milestoneduration
FROM df_train
WHERE status = "closed";
1 records
avg_milestoneduration
1.084851

The average milestone duration is almost double, with roughly 2 years between milestones, for successful startups versus failed ones.

2.3.2 Categorical Features

For every categorical feature the count of each category is created and the 10 most frequent ones are presented.

n <- 10  # Number of most frequent levels to show
for (i in feat_cat){
  
  counts <- df_train %>% count(!!sym(i)) %>% arrange(desc(n)) %>% head(n)
  
  p <- ggplot(counts, aes_string(x=i, y="n")) +
    geom_bar(stat="identity")
  
  plot(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 1 rows containing missing values (`position_stack()`).

The most common states are California by far, followed by New York and Massachusetts. This is also represented in the zip codes, as the most frequent ones start with 9 which indicates CA. And the most common cities are San Francisco and New York. It will be interesting to look at visualization of this geographcal data later. There are one duplicate id and name that have to be deleted. The founding dates are quite evenly spread out between 1999 and 2009. The closing dates of failed stratups are high in 2012 and 2013 and the first one closed already in 2009. The first funding rounds took place in 2005 to 2008 and the last funding rounds in 2006 until 2012.MostLast funding rounds were in 2008. The most common categories of startups are web and software.

2.3.3 Numerical Features

After analyzing single categorical features, the same is done for numerical features.

# Removing one hot encoded features
df_numeric <- df_train %>% 
  select(all_of(feat_num))
df_numeric <- df_numeric %>%
  select(-is_CA, -is_NY, -is_MA, -is_TX, -is_otherstate, -is_software, -is_web, -is_mobile, -is_enterprise, -is_advertising, -is_gamesvideo, -is_ecommerce, -   is_biotech, -is_consulting, -is_enterprise, -is_othercategory, -has_VC, -has_angel, -has_roundA, -has_roundB, -has_roundC, -has_roundD, -is_top500)

# Create histograms using ggplot2
hist_list <- lapply(names(df_numeric), function(col) {
  ggplot(df_numeric, aes(x = .data[[col]])) +
  geom_histogram(bins = 10) +
  ggtitle(paste0(col))
})

# Arrange histograms in a grid
library(patchwork)
hist_list[[1]] + hist_list[[2]] + hist_list[[3]] + plot_layout(ncol = 3)

hist_list[[4]] + hist_list[[5]] + hist_list[[6]] + plot_layout(ncol = 3)
## Warning: Removed 112 rows containing non-finite values (`stat_bin()`).
## Removed 112 rows containing non-finite values (`stat_bin()`).

hist_list[[7]] + hist_list[[8]] + hist_list[[9]] + plot_layout(ncol = 3)

The histogram for longitude verifies that the maximum value is an outlier.There are two local maxima, one around -120° and another one at around -80°, these are the West and East Coast of the United States. The latitude of most startups is between 35° an 40°, the North and South of the United States. For age_first_funding_year, age_last_funding_year, age_first_milestone_year and age_last_milestone_year, the assumption of outlier below 0 and higher than 10 is confirmed as well. The distributions of relationships and funding_rounds look as I imagined them. The maximum of funding_total_usd is an even more significant outlier than expected, as the hitsogram only has one visible bin on the very left. This needs to be removed.

2.3.4 Relationships

The fist realtionship analyized is between the numerical features. Correlation is a positive or negative linear relationship between two variables. +1 and -1 indicate the highest possible linear correlation and 0 indicated no correlation. As our label is categorical, it is not posisbel to identify correlations to it with this method. But if the correlations between features in the model is too high, this might lead to overfitting. This method helps to identify those and evaluate which to include in the mdoel in the feature engineering chapter.

df_numeric %>% 
  vis_cor(cor_method = "spearman", na_action = "pairwise.complete.obs")

There is a strong positive correlation (dark brown) between age_first_funding_year and age_last_funding_year as well as age_first_milestone_year and age_last_milestone_year. Another strong positive correlation exists between funding_total_usd and age_last_funding_year. Also milestones and relationships are positively correlated. There are no strong negative correlation, the strongest is between milestones and age_first_funding_year.

To make these results more presice, now the numbers of correlation are displayed. In addition we have a look at the graphs of the relationships to identify any non-lineat relationships as well.

df_numeric %>% 
  ggscatmat() +
  theme(axis.text.x = element_text(angle = 90, vjust = -1))
## Warning: The dot-dot notation (`..scaled..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(scaled)` instead.
## ℹ The deprecated feature was likely used in the GGally package.
##   Please report the issue at <https://github.com/ggobi/ggally/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 112 rows containing non-finite values (`stat_density()`).
## Removed 112 rows containing non-finite values (`stat_density()`).

There are no non-linear relationshsips identified. The strong positive correlation between the funding and milestone age variables is visible in the scatterplots and verified through correlations > 0.5.

At this, point further correlations between categorical and numerical features area visualized. Do successful startups have more or less relationships?

df_train %>%
  ggplot(aes(x = status, y = relationships, color = status)) +
  geom_boxplot() +
  labs(title = "Relationships, if successful or not") +
  theme_bw(base_size = 12) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Successful startups have more relationships than failes ones. Only 25% of the successful startups have less than 7 relations, whereas 75% of failes startups do have that amount.

Expanding on this. How is the amount of funding_rounds related to the status and relationships?

df_train %>%
  ggplot(aes(x = relationships, y = funding_rounds, group = status, color = status)) +
  geom_point() +
  theme_classic(base_size = 12) +
  ggtitle("Correlation of status, relationships and funding_rounds") +
  theme(legend.title = element_blank())

Acquired startups have more relationships and go through more funding rounds.

As discusessed ealsier, it is easier to visualize longitude and latititude than talking about their values. On the map the locations of startups are visualized and by color their are distiguished by status.

world_map <- map_data("world")

ggplot(df_train, aes(x = longitude, y = latitude, color = status)) +
  geom_point() +
  geom_path(data = world_map, aes(x = long, y = lat, group = group), color = "gray50") +
  labs(x = "Longitude", y = "Latitude", title = "Startup Locations")

There are a few startups located in Europe, but most are located in the US. Many acquired startups are rather on the East of the US.

Another map visualization.

library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
qmplot(x = longitude, 
       y = latitude, 
       data = df_train, 
       geom = "point", 
       color = status,
       alpha = 0.4) +
  scale_alpha(guide = 'none') # don't show legend for alpha
## ℹ Using `zoom = 4`
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.

Do successful startups have their first funding round earlier or later?

df_train %>%
  ggplot(aes(x = status, y = age_first_funding_year, color = status)) +
  geom_boxplot() +
  labs(title = "Age first funding, if successful or not") +
  theme_bw(base_size = 12) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

The first funding round happen around the same time for most startups, whether successful or not.

2.4 Define Schema

As a single dataframe is used for this data science lifecycle, the need for a schema is low. As the quality of the data seems quite high and the dataframe is well structures, no schema is defined for this project.

2.5 Anomaly Detection

An anomaly that was identified in the data analysis is that there is one duplicate startup by name and id. This should be removed.

# remove duplicate rows
train_data <- distinct(train_data)

Other anomalies are missing values and outliers, which are handled now.

2.5.1 Missing Values

At first, we need to find out how many missing values there are in which column. They will be fixed in the feature engineering recipe later on.

is.na(train_data) %>% 
    colSums()
##               state_code                 latitude                longitude 
##                        0                        0                        0 
##                 zip_code                       id                     city 
##                        0                        0                        0 
##                     name               founded_at                closed_at 
##                        0                        0                      442 
##         first_funding_at          last_funding_at   age_first_funding_year 
##                        0                        0                        0 
##    age_last_funding_year age_first_milestone_year  age_last_milestone_year 
##                        0                      112                      112 
##            relationships           funding_rounds        funding_total_usd 
##                        0                        0                        0 
##               milestones                    is_CA                    is_NY 
##                        0                        0                        0 
##                    is_MA                    is_TX            is_otherstate 
##                        0                        0                        0 
##            category_code              is_software                   is_web 
##                        0                        0                        0 
##                is_mobile            is_enterprise           is_advertising 
##                        0                        0                        0 
##            is_gamesvideo             is_ecommerce               is_biotech 
##                        0                        0                        0 
##            is_consulting         is_othercategory                   has_VC 
##                        0                        0                        0 
##                has_angel               has_roundA               has_roundB 
##                        0                        0                        0 
##               has_roundC               has_roundD                is_top500 
##                        0                        0                        0 
##                   status 
##                        0

There are 442 missing values in closed_at, probably because all successful startups haven’t closed, thus they don’t have a closing date. This feature is too tightly related to the prediction and thus needs to be removed for the model. There are 112 missing values each for age_first_milestone_year and age_last_milestone year. As there are only a couple hundred startups in the dataframe, the ratio of missing values is to high to simply delete these rows. Instead, the missing values will be replaced by the median. Having this many values replaced, the feature importance for the model will decrease.

2.5.2 Outlier Detection

In the data analysis, a few possible outliers have been called out and are investigated now.

A longitude of higher than 50 and thus being located in Europe was indentified as an outlier. How many startups are located there?

SELECT COUNT(*)
FROM df_train
WHERE longitude > -50;
1 records
COUNT(*)
4

There are 4 startups in Europe in the dataframe. This is very small amount, but as this dataframe will be expanded in the future through information of our customers, all locations are included from the beginning.

Another outlier are the negative first and last funding age. How many of those are there in the dataframe?

SELECT name, age_first_funding_year, age_last_funding_year
FROM df_train
WHERE age_first_funding_year < 0 OR age_last_funding_year < 0;
Displaying records 1 - 10
name age_first_funding_year age_last_funding_year
Behavio -0.0301 -0.0301
MoPub -0.0219 2.0329
GrandCentral -0.2466 -0.2466
Powerset -0.4192 0.6658
Zing Systems -0.4986 -0.4137
PayDivvy -0.1370 3.1452
Where -0.0027 5.0055
Storify -0.4466 1.1479
Surphace -0.1671 0.9151
The Huffington Post -0.0219 5.2329

There are 35 startups with either a negative first or last funding age. These are too many to remove them. If this projected was extended, the root cause for a negative funding age should be identified. Another realization from this analysis is that all startups that have a negative last funding age also have a negative first funding age.

The first and last funding age were rather left skewed, thus teh higher values are evaluated to be outliers now.

SELECT name, age_first_funding_year, age_last_funding_year
FROM df_train
WHERE age_first_funding_year > 5 OR age_last_funding_year > 5;
Displaying records 1 - 10
name age_first_funding_year age_last_funding_year
TriCipher 5.1260 9.9973
ClairMail 1.6466 6.7616
Stratavia 4.6658 8.9973
MeeVee 4.5918 7.1726
Lockdown Networks 4.1534 5.1671
Peer39 0.0000 5.9041
Mashery 0.6712 6.2329
Handmark 5.2027 6.1890
SchemaLogic 4.0466 9.3041
Intela 3.9616 6.5205

The higher first and last funding age is no outlier, one third of the startups have either their first or last funding after 5 years.

Another outlier are the negative first and last milestone age. How many of those are there in the dataframe?

SELECT name, age_first_milestone_year, age_last_milestone_year
FROM df_train
WHERE age_first_milestone_year < 0 OR age_last_milestone_year < 0;
Displaying records 1 - 10
name age_first_milestone_year age_last_milestone_year
MeeVee -0.4986 12.6795
Qwiki -0.5863 3.7534
Sentrigo -0.7479 4.4767
GrandCentral -0.4082 7.4548
Meridian -14.1699 2.2082
Tracelytics -0.0822 1.7205
Zing Systems -0.4986 5.7699
Socialize -2.4986 4.7205
PayDivvy -2.0932 4.4082
Airwide Solutions -1.9151 -1.9151

Just as negative first and last funding age, there are 35 startups that either have a negative first or last milestone age. Are those the same startups?

SELECT COUNT(*)
FROM df_train
WHERE age_first_milestone_year < 0 OR age_last_milestone_year < 0 OR age_first_funding_year < 0 OR age_last_funding_year < 0;
1 records
COUNT(*)
62

The startups having negative milestone and funding ages differ mostly. If it were the same, the count would be 35.

The variable relationships is left skewed. How many relationships are considered as an outlier?

# boxplot relationships
df_train %>%
  ggplot(aes(y = relationships)) +
  geom_boxplot()

Startups having more than 20 relationships are outliers.

The maximum fundinng_total_usd is tremendously higher than all other funding amounts. Should this be removed?

# boxplot funding_total_usd
df_train %>%
  ggplot(aes(y = funding_total_usd)) +
  geom_boxplot()

The other outliers are way lower than the maximum funding_total_usd, that is why the startups with extremely high funding have to be removed from the dataframe.

Funding rounds above 4 seem unreasonable and are thus investigated.

# boxplot funding_rounds
df_train %>%
  ggplot(aes(y = funding_rounds)) +
  geom_boxplot()

There are only 3 startups with more than 6 funding rounds.

The earliest founding date in the 1980s is very early in comparison to the rather right skewed values. Does it make sense to remove this?

# histogram founded_at
df_train %>%
  ggplot(aes(x = founded_at)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is a huge gap between the earliest founded_at and the next in the lat 1990s.

2.6 Feature Engineering

Feature engineering is the process of using domain knowledge to extract meaningful features from the data. The goal of this process is to keep those features that improve the predictions from our model. For this a pipeline is created that included the standardization of numeric features and one hot encoding of catgeorical features. As the categories and states are already one hot encoded, their categorical featured need to be removed. The anomalies, like outliers, missing values or wrong data types, will be fixed in the recipe in the next chapter.

3 Model

This chapter describes the creation of a model to predict the success of a startup. Several classification algorithms are optimized and compared to select the best performing model.

3.1 Select Algorithm

As the data is pre-categorized, having status as the to be predicted label, supervised learning in form of classification is used. As the data is not too complex, neural networks are not needed. Instead, two common classification algorithms are compared: - Logistic Regression: Predicts a dependent variable by analyzing the relationship between one or more existing independent variables. - Decision Trees: Uses tree-like model of decisions and their possible consequences. examples (K) closest to the query, then votes for the most frequent label.

3.2 Model Training & Tuning

Models for all three algorithms are created and optimized in this chapter.

3.2.1 Feature Selection

At first, we need to decide which features we want to include in our model. As identified in the correlation matrix, there is a strong positive linear correlation between the first and last funding and milestone age of startups. These highly correlated features are removed and this is included in the recipe of Feature Engineering already.

df_rec <-
  recipe(status ~ ., data = train_data) %>%
  step_rm(closed_at, state_code, category_code, latitude, longitude, zip_code, city, name, id) %>% #remove unnecessary columns
  step_impute_median(age_first_milestone_year, age_last_milestone_year) %>% #replace NaN with median
  step_naomit(everything(), skip = TRUE) %>% #remove rows with NaN values (only as double check)
  step_mutate(founded_at = as.numeric(founded_at)) %>% #date to numeric
  step_mutate(first_funding_at = as.numeric(first_funding_at)) %>%
  step_mutate(last_funding_at = as.numeric(last_funding_at)) %>%
  step_corr(all_predictors(), threshold = 0.7, method = "spearman") %>% #remove highly correlated
  #step_outliers_maha(all_numeric(), -all_outcomes()) %>%
  #step_outliers_lookout(all_numeric(),-contains(r"(.outliers)"),-all_outcomes()) %>% 
  #step_outliers_remove(contains(r"(.outliers)")) %>% # remove outliers
  step_zv(all_numeric(), -all_outcomes()) %>% #remove all numeric that have zero variance
  step_normalize(all_numeric(), -all_outcomes()) #standardize numeric features
# bake recipe
prepped_data <- 
  df_rec %>% # use the recipe object
  prep() %>% # perform the recipe on training data
  juice() # extract only the preprocessed dataframe 

Could not make outlier removel work, which could have a negative impact on the model performance.

From the training data another split is performed to create validation data. This is used to verify the model before adding new data to avoid overfitting of the model to the training data.

# crossvalidation
set.seed(100)

cv_folds <-
 vfold_cv(train_data, 
          v = 5, 
          strata = status) 

3.2.2 Training & Hyperparameter Tuning

In the first phase of the model process, the initial models are generated and their performance is compared during model evaluation. Besides the model specification, workflows need to be created to combine the data preparation recipe with the model. As a third step, our validation set (cv_folds) is fitted to estimate the performance of our models afterwards.

# Logistic Regression
# model
log_spec <- # your model specification
  logistic_reg() %>%  # model type
  set_engine(engine = "glm") %>%  # model engine
  set_mode("classification") # model mode

# workflow pipeline
log_wflow <- # new workflow object
 workflow() %>% # use workflow function
 add_recipe(df_rec) %>%   # use the new recipe
 add_model(log_spec)   # add your model spec

# fit model with crossvalidation
log_res <- 
  log_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(yardstick::recall, yardstick::precision, accuracy),
    control = control_resamples(save_pred = TRUE)
    ) 
## → A | warning: prediction from a rank-deficient fit may be misleading
## 
There were issues with some computations   A: x1

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5
# Decision Tree
# model
xgb_spec <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") 

# workflow pipeline
xgb_wflow <-
 workflow() %>%
 add_recipe(df_rec) %>% 
 add_model(xgb_spec)

# fit model with crossvalidation
xgb_res <- 
  xgb_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(yardstick::recall, yardstick::precision, accuracy),
    control = control_resamples(save_pred = TRUE)
    ) 

3.2.3 Evaluation

Now the results of the models are compared and their performance evaluated to decide which model is the best for our startup success prediction use case.

Average performance over all folds for logistic regression.

log_res %>% 
  collect_metrics(summarize = TRUE)
## # A tibble: 3 × 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  binary     0.731     5 0.0142  Preprocessor1_Model1
## 2 precision binary     0.760     5 0.0130  Preprocessor1_Model1
## 3 recall    binary     0.855     5 0.00501 Preprocessor1_Model1

We have a closer look at three metrics: Accuracy, precision and recall. The accuracy rates the overall correct classifications. Precision describes how much of the samples, which have been classified as positive are actual positive. And recall describes how much of the true positive samples has been classified as positive.

The logistic regression model has an accuracy of 73%. 76% of the startups that it predicted to be successful, are actually successful and 85% of true positive samples habe been classified as positive.

Performance for every single fold for logistic regression.

log_res %>% 
  collect_metrics(summarize = FALSE)
## # A tibble: 15 × 5
##    id    .metric   .estimator .estimate .config             
##    <chr> <chr>     <chr>          <dbl> <chr>               
##  1 Fold1 recall    binary         0.867 Preprocessor1_Model1
##  2 Fold1 precision binary         0.757 Preprocessor1_Model1
##  3 Fold1 accuracy  binary         0.734 Preprocessor1_Model1
##  4 Fold2 recall    binary         0.844 Preprocessor1_Model1
##  5 Fold2 precision binary         0.768 Preprocessor1_Model1
##  6 Fold2 accuracy  binary         0.734 Preprocessor1_Model1
##  7 Fold3 recall    binary         0.854 Preprocessor1_Model1
##  8 Fold3 precision binary         0.752 Preprocessor1_Model1
##  9 Fold3 accuracy  binary         0.725 Preprocessor1_Model1
## 10 Fold4 recall    binary         0.865 Preprocessor1_Model1
## 11 Fold4 precision binary         0.802 Preprocessor1_Model1
## 12 Fold4 accuracy  binary         0.775 Preprocessor1_Model1
## 13 Fold5 recall    binary         0.843 Preprocessor1_Model1
## 14 Fold5 precision binary         0.721 Preprocessor1_Model1
## 15 Fold5 accuracy  binary         0.686 Preprocessor1_Model1

The range for all metrics is not too big, bur rather similar for every model.

Now, we obtain the actual model predictions to compare the amount of correct and false predictions.

log_pred <- 
  log_res %>%
  collect_predictions()

A confusion matrix is created to compare the amounts of predictions. How to read a classification matrix:

For the StandardScaler Pipeline this means that in 5 cases label 0 was predicted correctly, 22 times as label 1 and 59 times as label 2. Lables 3 and 4 were not predicted correctly at all.

From these numbers, various metrics can be calculated.

Accuracy: The rate of overall correct classifications.

\[ ACC=\frac{TP+TN}{FP+FN+TP+TN} \]

Precision: How much of the samples, which have been classified as positive are actual positive

\[ PRE=\frac{TP}{FP+TP} \]

Recall: How much of the true positive samples has been classified as positive

\[ REC=\frac{TP}{FN+TP} \]

For the prediction of startup success, we want the number of False Positives to be as little as possible, because these mean an investment gone wrong and lots of monex being lost for the investors and thus focus on precision.

log_pred %>%
  conf_mat(status, .pred_class) 
##           Truth
## Prediction acquired closed
##   acquired      382    121
##   closed         65    123

The confusion matrix can also be visualized to make reading it easier.

log_pred %>%
  conf_mat(status, .pred_class) %>% 
  autoplot(type = "heatmap")

382 startups are True Positives, as they have been predicted to be acquired and have also been acquired in reality. On the other hand, 123 startups that closed have also been predicted to close, those are the True Negatives. There are 65 False Negatives, that have been acquired, but were predicted to close. The fourth kind are the False Positives, those have closed in reality, but were predicted to be acquired. There are 121 False Positives in this logistic regression and this is the number that should be as low as possible.

Now, we will have a look at the metrics of the other algorithm - Gradient Bossted Trees.

xgb_res %>% 
  collect_metrics(summarize = TRUE)
## # A tibble: 3 × 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  binary     0.760     5  0.0134 Preprocessor1_Model1
## 2 precision binary     0.779     5  0.0137 Preprocessor1_Model1
## 3 recall    binary     0.879     5  0.0123 Preprocessor1_Model1

The Gradient Boosted Tree has an accuracy of 76%, recall of 88% and most importantly a precision of 78%.

Looking in more detail at the correct and false predictions of the Gradient Boosted Trees.

xgb_pred <- 
  xgb_res %>%
  collect_predictions()
xgb_pred %>%
  conf_mat(status, .pred_class) %>% 
  autoplot(type = "heatmap")

There are 112 False Positives, so startups that have been predicted to be acquired, but actually closed.

Instead of looking at the metrics one by one, we compare them now.

# compare models
log_metrics <- 
  log_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Logistic Regression") # add the name of the model to every row

xgb_metrics <- 
  xgb_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "XGBoost")

# create dataframe with all models
model_compare <- bind_rows(log_metrics,
                           xgb_metrics) 

# change data structure
model_comp <- 
  model_compare %>% 
  select(model, .metric, mean, std_err) %>% 
  pivot_wider(names_from = .metric, values_from = c(mean, std_err)) 

# show mean accuracy for every model
model_comp %>% 
  arrange(mean_accuracy) %>% 
  mutate(model = fct_reorder(model, mean_accuracy)) %>% # order results
  ggplot(aes(model, mean_accuracy, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") +
   geom_text(
     size = 3,
     aes(label = round(mean_accuracy, 2)),
     vjust = 1
  )

# show mean precision per model
model_comp %>% 
  arrange(mean_precision) %>% 
  mutate(model = fct_reorder(model, mean_precision)) %>%
  ggplot(aes(model, mean_precision, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") + 
     geom_text(
     size = 3,
     aes(label = round(mean_precision, 2)),
     vjust = 1
  )

The accuracy and precision of the Decision Tree are a little higher and thus, this is the model with the best performance. The objective for a good model, is supposed to be 80% for accuracy and precision and those are not quite reached, but at it is close.

3.3 Evaluate Model

The best model is evaluated on its errors to understand why it makes them and how to fix them.

# obtain the 10 wrongest predictions
wrongest_predictions <- 
  xgb_pred %>% 
  mutate(residual = status - .pred_class) %>% 
  arrange(desc(abs(residual))) %>% 
  slice_head(n = 10)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `residual = status - .pred_class`.
## Caused by warning in `Ops.factor()`:
## ! '-' not meaningful for factors
# take a look at the observations
errors <- 
  train_data %>% 
  dplyr::slice(wrongest_predictions$.row) 

Does not seem to work, cannot find a solution to the error.

4 Deployment

The last chapter describes the deployment of the final model.

4.1 Validate Model

The best performing model - Decision Tree - is now evaluated on test data.

last_fit_xgb <- last_fit(xgb_wflow, 
                        split = data_split,
                        metrics = metric_set(yardstick::recall, yardstick::precision, accuracy)
                        )
last_fit_xgb %>% 
  collect_metrics()
## # A tibble: 3 × 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 recall    binary         0.913 Preprocessor1_Model1
## 2 precision binary         0.801 Preprocessor1_Model1
## 3 accuracy  binary         0.797 Preprocessor1_Model1

The results on the test data are even a little better than on the train data. The Decision Tree reaches an accuracy of almost 80% and a precision of 80%. The recall, which is not as important for this use case, is even at 91%.

last_fit_xgb %>%
  collect_predictions() %>% 
  conf_mat(status, .pred_class) %>% 
  autoplot(type = "heatmap")

Looking at the classification matrix, there are 137 True Positives compared to 34 startups that have been acquired, but were predicted to close.

Which features are the most important for startup success prediction?

# feature importance
last_fit_xgb %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 10)
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The most important features are the number of relationships, the total funding and the age at first funding.

4.2 Deploy Model

There are three major options to deploy: Posit Connect, Tensorflow and Plumber.

To deploy the model a connection to Posit Connect was tried to be established. This did not work, but instead received this error when trying to install through th terminal as described here:

(base) ~ % curl -Lo rsc-installer.sh https://cdn.rstudio.com/connect/installer/installer-v6.4.0.sh sudo -E bash ./rsc-installer.sh 2023.03.0

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 23.7M 100 23.7M 0 0 7742k 0 0:00:03 0:00:03 –:–:– 7753k Password: helena.schick is not in the sudoers file. This incident will be reported.

I tried this code though before realizing that I cannot make it work:

Fit models

xgb_fit <- fit(xgb_wflow, data = train_data) xgb_fit model_name <- “xgb_startup_success_prediction” pin_name <- glue(“helena.schick/{model_name}”) date_metadata <- list( train_dates = c( as.character(min(train_data\(founded_at)), as.character(max(train_data\)founded_at)) ), test_dates = c( as.character(min(test_data\(founded_at)), as.character(max(test_data\)founded_at)) ) ) print(date_metadata) # Create the vetiver model. v <- vetiver_model( xgb_fit, model_name, versioned = TRUE, save_ptype = train_data %>% head(1) %>% select(-status), metadata = date_metadata ) v

Use RStudio Connect as a board.

board <- pins::board_connect() # Write the model to the board. board %>% vetiver_pin_write(vetiver_model = v)

Add server

rsconnect::addServer( url = “https://colorado.rstudio.com/rsc/__api__”, name = “colorado” ) # Add account if (FALSE) { rsconnect::connectApiUser( account = “helena.schick”, server = “colorado”, apiKey = Sys.getenv(“CONNECT_API_KEY”), ) } # Deploy to Connect vetiver_deploy_rsconnect( board = board, name = pin_name, appId = “11314”, launch.browser = FALSE, appTitle = “Startup Predict - Model - API”, predict_args = list(debug = FALSE), account = “helena.schick”, server = “colorado” )

Thus, RStudio’s Model Management could not be used.

As I have a MacBook with M1, tensorflow does not properly run there. This is why, I did not try TFX pipelines.

The last option is plumber. This has to be run in a R Script, see plumber.R. The model has to be saved, so it can be inserted in the R Script.

saveRDS(last_fit_xgb, "model.rds")

4.3 Serve Model

For this project, serving the model cannot be conducted. If this became a real business case though, serving the model would be implemented through information of new startups. The model can learn from adding this new information.

4.4 Monitor Model

The inability to conduct this step of monitoring the model applies as to serving the model. Over the course of time and when ingesting new information, the performance of the model would be monitored. Ideally, the model performance increases through additional data. If this weren’t the case, the hyperparameters might need to be adjusted or even a better performing model needs to be evaluated.